Load the cleaned data from the previous steps done in
data_preparation.rmd file.
Create a correlation matrix to understand the relationships between variables.
# Select only numeric columns for correlation
numerical_cols <- koi_data %>%
select(
koi_period, koi_duration, koi_depth, koi_prad, koi_teq,
koi_insol, koi_model_snr, koi_steff, koi_slogg, koi_srad,
koi_smass, koi_impact, koi_ror, koi_srho, koi_sma, koi_incl,
koi_dor, koi_ldm_coeff1, koi_ldm_coeff2, koi_smet
) %>%
drop_na()
# Calculate the correlation matrix
cor_matrix <- cor(numerical_cols)
# Visualize the correlation matrix
ggcorrplot(cor_matrix,
hc.order = TRUE, # Hierarchical clustering
type = "upper", # Show upper triangle
lab = TRUE, # Show correlation coefficients
lab_size = 3, # Adjust label size
method = "circle", # Use circles to represent correlation
colors = c("#6D9EC1", "white", "#E46726")
) # Specify color schemeThe correlation matrix shows us that there are some strong
relationships between some variables. For example, the correlation
between koi_period and koi_duration is 0.99,
indicating a very strong positive relationship. This suggests that as
the orbital period increases, the transit duration also tends to
increase.
Perform PCA on the selected numerical variables.
numerical_pca_cols <- koi_data %>%
select(
koi_period, koi_duration, koi_depth, koi_prad, koi_teq,
koi_insol, koi_model_snr, koi_steff, koi_slogg, koi_srad,
koi_smass, koi_impact, koi_ror, koi_srho, koi_sma, koi_incl,
koi_dor, koi_ldm_coeff1, koi_ldm_coeff2, koi_smet
)
disposition_col <- koi_data$koi_pdisposition
pca_data_complete <- numerical_pca_cols %>% drop_na()
disposition_complete <- disposition_col[complete.cases(numerical_pca_cols)]
if (length(disposition_complete) != nrow(pca_data_complete)) {
stop("Mismatch between data rows and disposition labels after handling NAs.")
}
# Scale the Data (Standardize)
scaled_pca_data <- scale(pca_data_complete)
pca_result <- prcomp(scaled_pca_data, center = FALSE, scale. = FALSE)Shows proportion of variance explained by each component
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8409 1.7355 1.6688 1.5467 1.24685 1.12924 1.09109
## Proportion of Variance 0.1694 0.1506 0.1393 0.1196 0.07773 0.06376 0.05952
## Cumulative Proportion 0.1694 0.3200 0.4593 0.5789 0.65663 0.72039 0.77992
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.9317 0.83983 0.8246 0.6914 0.66262 0.62824 0.52575
## Proportion of Variance 0.0434 0.03527 0.0340 0.0239 0.02195 0.01973 0.01382
## Cumulative Proportion 0.8233 0.85858 0.8926 0.9165 0.93844 0.95817 0.97199
## PC15 PC16 PC17 PC18 PC19 PC20
## Standard deviation 0.45004 0.41454 0.33987 0.23551 0.10610 0.05948
## Proportion of Variance 0.01013 0.00859 0.00578 0.00277 0.00056 0.00018
## Cumulative Proportion 0.98212 0.99071 0.99649 0.99926 0.99982 1.00000
From the eigenvalues, we can see that the first two principal components explain approximately 32% of the total variance. This suggests that the first two principal components does not capture much of the variability in the data. We need the first 11 PCA to get over 90% of the variance, suggesting that the underlying structure of the data (based on these numerical variables) is quite complex. There isn’t a simple, low-dimensional linear subspace that captures most of the information.
Show how original variables contribute to each PC using rotation matrix. The loadings tell us how much each original variable contributes to each principal component. Larger absolute values mean stronger influence. The sign (+/-) indicates the direction of the correlation.
## PC1 PC2 PC3 PC4 PC5
## koi_period 0.30639694 0.347714363 0.274495160 -0.008366897 0.01847917
## koi_duration 0.04269443 0.231972796 0.115843053 -0.015138855 -0.07234090
## koi_depth -0.12450017 0.127549414 -0.052461072 -0.140470095 -0.62521527
## koi_prad -0.15286222 0.058554951 0.120548664 -0.435568096 0.11140140
## koi_teq -0.42635722 -0.120328101 0.183977631 0.122937219 0.02213583
## koi_insol -0.14607512 -0.123562878 0.304706726 0.046459281 -0.09016303
## koi_model_snr -0.11924810 0.147469117 -0.048934105 -0.089975397 -0.62370781
## koi_steff -0.26437111 0.362258812 -0.072322603 0.209410982 0.07667573
## koi_slogg 0.25605738 -0.025127674 -0.435100454 -0.115861946 0.02756831
## koi_srad -0.17033395 -0.129612484 0.444175616 0.034644448 -0.08661357
## koi_smass -0.29636009 0.203438378 0.260224242 0.222440536 0.09382075
## koi_impact -0.19050425 0.118816716 0.001398144 -0.508153075 0.22162035
## koi_ror -0.17094262 0.131493527 0.009384989 -0.548172046 0.04619732
## koi_srho 0.06774909 0.066623582 0.081594434 -0.177212392 0.11099283
## koi_sma 0.30796798 0.361668040 0.286202772 0.001518795 0.01479373
## koi_incl 0.25332897 -0.005414288 0.049045471 0.057546845 -0.23841575
## koi_dor 0.29903665 0.280085628 0.250674161 -0.041381312 0.02445484
## koi_ldm_coeff1 0.21466163 -0.409881812 0.244745270 -0.174726148 -0.08714585
## koi_ldm_coeff2 -0.16513550 0.376579150 -0.276071465 0.151030824 0.10000747
## koi_smet 0.06242363 -0.095393670 0.123653316 0.091883359 0.17810853
## PC6 PC7 PC8 PC9 PC10
## koi_period 0.119360611 0.051875537 -0.06019383 -0.02002505 -0.18962741
## koi_duration 0.570839165 -0.215695497 0.20280440 0.23681782 0.55006942
## koi_depth -0.068584639 -0.079795012 0.11874165 0.06465849 -0.13349936
## koi_prad -0.131240154 -0.047792214 -0.09561726 -0.12698835 0.32123182
## koi_teq -0.019868440 0.150918255 0.13576536 -0.03384746 -0.22829643
## koi_insol 0.041738202 0.369541809 -0.31817638 0.64825953 -0.02700493
## koi_model_snr -0.054104702 -0.108447734 0.11403543 0.07830158 -0.08374526
## koi_steff -0.126839948 -0.122390073 -0.02680701 0.03044942 -0.06753539
## koi_slogg 0.006172822 0.078111944 -0.06601230 0.34949241 -0.13511190
## koi_srad 0.038663393 0.194322150 -0.15544735 -0.03160191 0.15628893
## koi_smass -0.160901331 -0.307606379 0.06812431 -0.12983491 -0.02450837
## koi_impact 0.093935585 -0.062617207 -0.13387480 0.03102445 -0.17452222
## koi_ror -0.050116478 -0.081455459 -0.18216056 0.02736957 -0.09446960
## koi_srho -0.487708565 0.284586266 0.61362328 0.20829062 0.32003174
## koi_sma 0.097524343 -0.007427604 -0.09031723 -0.02579694 -0.09548265
## koi_incl -0.460956458 -0.064815858 -0.51388871 -0.15445623 0.37733221
## koi_dor -0.175128087 0.185140235 0.14716568 -0.01272897 -0.31841721
## koi_ldm_coeff1 0.066040003 -0.172049829 0.13495052 -0.07410840 -0.09869411
## koi_ldm_coeff2 -0.062357540 0.202944864 -0.15945606 0.12053377 0.13967629
## koi_smet -0.281578572 -0.645526832 -0.01668499 0.51474574 -0.09385791
## PC11 PC12 PC13 PC14 PC15
## koi_period -0.11820578 0.061591783 -0.01431356 0.076013208 0.459993894
## koi_duration 0.12960162 0.109252424 0.00432354 0.066515371 -0.103379379
## koi_depth -0.04267725 0.396045760 0.54672910 -0.064810340 -0.012008518
## koi_prad -0.73163329 0.182306235 -0.15406092 -0.005183395 -0.095637702
## koi_teq -0.12060014 0.185245607 -0.00728660 0.303339231 0.376869420
## koi_insol 0.04266643 0.177584952 -0.20029275 0.179851481 -0.183178017
## koi_model_snr -0.09554845 -0.510837854 -0.49530116 0.042730159 0.058988510
## koi_steff 0.10671698 0.361573273 -0.35262834 -0.486478946 0.016285554
## koi_slogg -0.11500813 0.186729780 -0.12435748 -0.334522107 0.120528755
## koi_srad -0.01654610 -0.328290690 0.26449586 -0.659210103 0.102248018
## koi_smass 0.15456549 0.084960880 -0.08450251 0.080795833 -0.200376338
## koi_impact 0.32277566 -0.149681992 -0.01361718 -0.031637396 0.047790992
## koi_ror 0.28721255 0.004764452 0.09098955 0.080315291 0.004165243
## koi_srho 0.20516679 -0.005572167 -0.02150910 -0.021955538 0.210846592
## koi_sma -0.04104112 0.017431854 -0.02271675 0.016874341 0.262161956
## koi_incl 0.25593040 0.131709318 -0.05710512 0.143027074 0.077333599
## koi_dor -0.07984545 -0.055057826 0.05838449 -0.012100698 -0.627522154
## koi_ldm_coeff1 0.06933213 0.130826127 -0.13079360 -0.004982672 -0.011208722
## koi_ldm_coeff2 -0.13521739 -0.297682322 0.30880508 0.202161085 0.004390644
## koi_smet -0.18008029 -0.194697850 0.21481309 0.003904580 0.092571406
## PC16 PC17 PC18 PC19
## koi_period 0.03802923 -0.029875167 -0.021412450 6.437068e-01
## koi_duration -0.29905083 -0.103337964 0.096615121 2.303355e-03
## koi_depth 0.16802341 0.069849268 0.112401644 1.515136e-03
## koi_prad 0.06895291 0.002454426 0.048781159 2.186547e-03
## koi_teq -0.52158913 -0.189546320 0.179027991 -1.908513e-01
## koi_insol 0.22205127 0.087711186 -0.042019739 4.009233e-02
## koi_model_snr -0.03694752 -0.039322810 0.004593793 -4.513809e-03
## koi_steff -0.16381831 0.350713071 -0.051366675 1.855004e-02
## koi_slogg -0.07264963 -0.607563323 0.095505260 -7.671468e-02
## koi_srad -0.09304371 -0.165404261 -0.006564376 8.813710e-03
## koi_smass 0.38913201 -0.600489892 0.005196951 7.156391e-02
## koi_impact 0.11422003 0.096604457 0.653988234 -8.810851e-05
## koi_ror -0.24061955 -0.120915818 -0.654949610 -1.554008e-03
## koi_srho 0.10595934 0.023054427 -0.010784763 1.216639e-03
## koi_sma 0.22260594 0.040643864 -0.077415692 -7.320998e-01
## koi_incl -0.24358061 -0.075401764 0.220602020 -2.777929e-03
## koi_dor -0.38968330 -0.050942658 0.132763452 2.187147e-03
## koi_ldm_coeff1 0.03353477 -0.044421110 -0.002719654 -7.414143e-03
## koi_ldm_coeff2 0.01801750 -0.069436056 0.011572288 -8.901582e-03
## koi_smet -0.12982706 0.130710575 0.017994995 -5.298896e-03
## PC20
## koi_period 6.044490e-03
## koi_duration 5.976220e-04
## koi_depth -1.064006e-03
## koi_prad 1.674146e-04
## koi_teq 2.041897e-03
## koi_insol -1.814064e-03
## koi_model_snr 5.295556e-03
## koi_steff 2.245508e-01
## koi_slogg -1.346323e-05
## koi_srad 1.387013e-02
## koi_smass -9.313680e-03
## koi_impact 5.237360e-03
## koi_ror -5.471374e-03
## koi_srho 3.722707e-04
## koi_sma -5.138471e-03
## koi_incl -4.704890e-04
## koi_dor -2.463456e-04
## koi_ldm_coeff1 7.603596e-01
## koi_ldm_coeff2 6.073264e-01
## koi_smet -4.634635e-02
Visualize Loadings for PC1 and PC2
## [1] "Loadings Plot for PC1 vs PC2:"
fviz_pca_var(pca_result,
col.var = "contrib", # Color by contributions
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)Analysis of the component loadings revealed distinct patterns captured by the principal components.
koi_period, koi_sma, koi_dor
(larger orbits) and high negative loadings for koi_teq
(cooler temperatures associated with larger orbits). Stellar properties
(koi_slogg, koi_steff, koi_smass)
also contribute moderately.koi_period, koi_sma,
koi_dor) but also strongly incorporates stellar temperature
(koi_steff positive loading) and limb darkening
(koi_ldm_coeff1 negative, koi_ldm_coeff2
positive).koi_srad, koi_insol
positive) with stellar surface gravity (koi_slogg
negative). Orbital size variables also contribute moderately.koi_prad,
koi_ror (planet/star radius ratio), and
koi_impact.koi_depth and
koi_model_snr.koi_duration,
koi_srho). PC7 involves insolation and metallicity
(koi_insol, koi_smet). PC19/PC20 seem to
isolate specific period/axis relationships and limb darkening
effects.These interpretations suggest that the primary sources of variation in the dataset relate to the transit signal strength, stellar characteristics, transit geometry, and orbital properties.
Combine PCA results with the disposition information and plot the results.
pca_plot_data <- data.frame(
PC1 = pca_result$x[, 1],
PC2 = pca_result$x[, 2],
Disposition = disposition_complete
)
autoplot(pca_result,
data = data.frame(pca_data_complete, Disposition = disposition_complete), colour = "Disposition",
loadings = TRUE, loadings.colour = "blue",
loadings.label = TRUE, loadings.label.size = 3
) +
labs(title = "PCA Plot with Loadings") +
theme_minimal()fviz_pca_ind(pca_result,
geom.ind = "point", # show points only (but can use "text")
col.ind = disposition_complete, # color by groups
palette = "jco", # Journal color palette
addEllipses = TRUE, # Concentration ellipses
legend.title = "Disposition"
) +
ggtitle("PCA Plot of Individuals")pca_scores_df_7 <- data.frame(pca_result$x[, 1:7], Disposition = disposition_complete)
ggpairs(pca_scores_df_7,
columns = 1:7, # Specify columns for the PC dimensions
aes(color = Disposition, alpha = 0.6), # Map color and transparency to Disposition
upper = list(continuous = wrap("cor", size = 3)), # Show correlation in upper panels
lower = list(continuous = wrap("points", size = 1)), # Show scatter plots in lower panels
diag = list(continuous = wrap("densityDiag", alpha = 0.5)), # Show density plots on diagonal
title = "Pairs Plot Matrix of First 7 Principal Components"
) +
theme_minimal() + # Apply a theme
theme(axis.text.x = element_text(angle = 45, hjust = 1))First, let’s see the balance between the different dispositions in
the dataset using the pipeline disposition
(koi_pdisposition).
ggplot(koi_data %>% filter(!is.na(koi_pdisposition)), aes(x = koi_pdisposition, fill = koi_pdisposition)) +
geom_bar() +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
labs(
title = "Distribution of Pipeline Dispositions",
x = "Pipeline Disposition (koi_pdisposition)",
y = "Count"
) +
theme_minimal() +
theme(legend.position = "none") # Hide legend as fill is redundantThis plot shows the number of KOIs classified as CANDIDATE vs. FALSE POSITIVE by the Kepler pipeline (within the loaded dataset, potentially after some filtering/NA removal). We can observe the relative balance between these classes, which is important context for model building and evaluation (e.g., calculating baseline accuracy). The classes appear reasonably balanced in this dataset.
Explore if planet size relates to the host star’s metallicity.
ggplot(
koi_data %>% filter(!is.na(koi_smet), !is.na(koi_prad), !is.na(koi_pdisposition), koi_prad > 0),
aes(x = koi_smet, y = koi_prad, color = koi_pdisposition)
) +
geom_point(alpha = 0.6, size = 1.5) +
scale_y_log10(breaks = c(0.1, 0.3, 1, 3, 10, 30), labels = scales::label_number(accuracy = 0.1)) + # Planet radius often plotted on log scale
labs(
title = "Stellar Metallicity vs. Planetary Radius",
x = "Stellar Metallicity [Fe/H] (koi_smet)",
y = "Planetary Radius [Earth Radii] (koi_prad) (log scale)",
color = "Pipeline Disposition"
) +
theme_minimal() +
annotation_logticks(sides = "l") # Add log ticks to y-axisThis plot investigates whether larger planets tend to form around stars with higher metallicity (more heavy elements). While some studies suggest such a trend, especially for gas giants, it might not be strongly apparent here without statistical analysis. We can visually inspect if CANDIDATEs (blue) and FALSE POSITIVEs (red) occupy different regions or show different trends in this parameter space. False positives might appear across the metallicity range.
Understand the frequency of different planet sizes.
ggplot(koi_data %>% filter(!is.na(koi_prad), koi_prad > 0), aes(x = koi_prad)) +
geom_histogram(bins = 50) + # Adjust binwidth/bins as needed
scale_x_log10(breaks = c(0.1, 0.3, 1, 3, 10, 30, 100), labels = scales::label_number(accuracy = 0.1)) +
labs(
title = "Distribution of Planetary Radii",
x = "Planetary Radius [Earth Radii] (koi_prad) (log scale)",
y = "Count"
) +
theme_minimal() +
annotation_logticks(sides = "b") # Add log ticks to x-axisThis histogram reveals the distribution of detected planet candidate sizes. We often expect to see peaks corresponding to common planet types (like super-Earths/mini-Neptunes around 1.5-4 Earth radii) and potentially a dip known as the “radius valley” or “Fulton gap” around 1.5-2 Earth radii, separating rocky super-Earths from gaseous mini-Neptunes. The distribution is heavily influenced by detection biases (larger planets are easier to find).
Understand the frequency of different orbital periods.
ggplot(koi_data %>% filter(!is.na(koi_period), koi_period > 0), aes(x = koi_period)) +
geom_histogram(bins = 50) + # ggplot chooses bins, or set binwidth/bins
scale_x_log10(breaks = c(0.1, 1, 10, 100, 1000)) +
labs(
title = "Distribution of Orbital Periods",
x = "Orbital Period [Days] (log scale)",
y = "Count"
) +
theme_minimal() +
annotation_logticks(sides = "b") # Add log ticks to x-axisThis histogram shows that the vast majority of detected KOIs have short orbital periods (typically less than 50-100 days). This is largely due to detection bias: planets with shorter periods transit more frequently, making them easier to detect in the fixed duration of the Kepler mission.
A classic plot in exoplanet studies, often revealing distinct populations. Color by disposition.
ggplot(
koi_data %>% filter(!is.na(koi_prad), !is.na(koi_period), koi_prad > 0, koi_period > 0, !is.na(koi_pdisposition)),
aes(x = koi_period, y = koi_prad, color = koi_pdisposition)
) +
geom_point(alpha = 0.5, size = 1.5) + # Adjust alpha/size
scale_x_log10(breaks = c(0.1, 1, 10, 100, 1000)) +
scale_y_log10(breaks = c(0.1, 0.3, 1, 3, 10, 30, 100), labels = scales::label_number(accuracy = 0.1)) +
labs(
title = "Orbital Period vs. Planetary Radius",
x = "Orbital Period [Days] (log scale)",
y = "Planetary Radius [Earth Radii] (log scale)",
color = "Disposition" # Using Archive Disposition here
) +
theme_minimal() + # Or other themes
annotation_logticks(sides = "lb") # Add log ticks to both axesThis fundamental plot shows planet radius against orbital period. We can identify known exoplanet populations: Hot Jupiters (large radius, short period - top left), potentially a “Neptunian desert” (a region with fewer Neptune-sized planets at very short periods), and the bulk of smaller planets (Super-Earths/Mini-Neptunes). Coloring by disposition helps visualize where confirmed planets (green), candidates (blue), and false positives (red) lie. False positives might cluster in certain areas (e.g., very large radii suggesting eclipsing binaries) or be scattered throughout.
Explore potential atmospheric regimes based on stellar energy received.
ggplot(
koi_data %>% filter(!is.na(koi_prad), !is.na(koi_insol), koi_prad > 0, koi_insol > 0, !is.na(koi_pdisposition)),
aes(x = koi_insol, y = koi_prad, color = koi_pdisposition)
) +
geom_point(alpha = 0.5) +
scale_x_log10() + # Insolation often spans orders of magnitude
scale_y_log10(breaks = c(0.1, 0.3, 1, 3, 10, 30, 100), labels = scales::label_number(accuracy = 0.1)) +
labs(
title = "Insolation Flux vs. Planetary Radius",
x = "Insolation Flux [Earth Flux] (log scale)",
y = "Planetary Radius [Earth Radii] (log scale)",
color = "Disposition"
) +
theme_minimal() +
annotation_logticks(sides = "lb") # Add log ticks to both axesThis plot relates the amount of energy a planet receives from its star to its size. High insolation can affect planetary atmospheres (e.g., photo-evaporation potentially contributing to the radius valley). We can examine if candidates and false positives separate based on these parameters. For instance, highly irradiated large objects might be more likely to be false positives (binaries).
Explore the relationship between the measured transit depth and its signal-to-noise ratio.
ggplot(
koi_data %>% filter(!is.na(koi_depth), !is.na(koi_model_snr), koi_depth > 0, koi_model_snr > 0, !is.na(koi_pdisposition)),
aes(x = koi_depth, y = koi_model_snr, color = koi_pdisposition)
) +
geom_point(alpha = 0.5) +
scale_x_log10() +
scale_y_log10() +
labs(
title = "Transit Depth vs. Model Signal-to-Noise Ratio",
x = "Transit Depth [ppm] (log scale)",
y = "Transit Signal-to-Noise Ratio (log scale)",
color = "Pipeline Disposition"
) +
theme_minimal() +
annotation_logticks(sides = "lb")As expected, there is a strong positive correlation between transit depth and SNR – deeper transits are easier to detect with higher confidence. This plot helps visualize if false positives tend to cluster at lower SNRs or specific depths. Some false positives might have high SNR but other characteristics (like V-shaped transits, not shown here) that disqualify them. Candidates span a wide range of depths and SNRs.
Compare the distributions of important numeric variables between CANDIDATEs and FALSE POSITIVEs.
# Example: Orbital Period
p1 <- ggplot(
koi_data %>% filter(!is.na(koi_period), !is.na(koi_pdisposition), koi_period > 0),
aes(x = koi_pdisposition, y = koi_period, fill = koi_pdisposition)
) +
geom_boxplot(outlier.shape = NA) + # Hide outliers for clarity on main distribution
scale_y_log10(limits = c(NA, quantile(koi_data$koi_period, 0.99, na.rm = TRUE))) + # Zoom y-axis, adjust quantile if needed
labs(y = "Orbital Period (log)", x = "Disposition") +
theme_minimal() +
theme(legend.position = "none")
# Example: Planetary Radius
p2 <- ggplot(
koi_data %>% filter(!is.na(koi_prad), !is.na(koi_pdisposition), koi_prad > 0),
aes(x = koi_pdisposition, y = koi_prad, fill = koi_pdisposition)
) +
geom_boxplot(outlier.shape = NA) +
scale_y_log10(limits = c(NA, quantile(koi_data$koi_prad, 0.99, na.rm = TRUE))) +
labs(y = "Planetary Radius (log)", x = "Disposition") +
theme_minimal() +
theme(legend.position = "none")
# Example: Transit Duration
p3 <- ggplot(
koi_data %>% filter(!is.na(koi_duration), !is.na(koi_pdisposition), koi_duration > 0),
aes(x = koi_pdisposition, y = koi_duration, fill = koi_pdisposition)
) +
geom_boxplot(outlier.shape = NA) +
scale_y_continuous(limits = c(NA, quantile(koi_data$koi_duration, 0.99, na.rm = TRUE))) + # May not need log scale
labs(y = "Transit Duration", x = "Disposition") +
theme_minimal() +
theme(legend.position = "none")
# Example: Transit SNR
p4 <- ggplot(
koi_data %>% filter(!is.na(koi_model_snr), !is.na(koi_pdisposition), koi_model_snr > 0),
aes(x = koi_pdisposition, y = koi_model_snr, fill = koi_pdisposition)
) +
geom_boxplot(outlier.shape = NA) +
scale_y_log10(limits = c(NA, quantile(koi_data$koi_model_snr, 0.99, na.rm = TRUE))) +
labs(y = "Model SNR (log)", x = "Disposition") +
theme_minimal() +
theme(legend.position = "none")
# Show plots sequentially if packages not loaded/preferred
print(p1 + labs(title = "Period Distribution"))These boxplots compare the central tendency (median) and spread (interquartile range) of key variables between pipeline CANDIDATEs and FALSE POSITIVEs. Significant differences in the distributions suggest a variable might be a good discriminator between the classes. For example, we might observe that FALSE POSITIVEs tend to have larger median radii or perhaps shorter durations compared to CANDIDATEs, although overlap is expected. Variables showing clear separation are likely important features for predictive models. (Note: axis limits are adjusted to focus on the bulk of the distribution, hiding extreme outliers for visual clarity of the boxes).